<HTML>
<HEAD>
<TITLE>SEQIO.DOC  -  The SEQIO Package Interface</TITLE>
<owner_name="James Knight, knight@cs.ucdavis.edu">
<LINK REV="made" HREF="mailto:knight@cs.ucdavis.edu">
</HEAD>

<BODY>

<I><A HREF="seqio.html">SEQIO -- A Package for Sequence File I/O</A></I>
<HR>

<P>
<H1>SEQIO.DOC  -  The SEQIO Package Interface</H1>

<HR>

<P>
The SEQIO package is a set of C functions which can read and write
biological sequence files formatted using various file formats and
which can be used to perform database searches on biological
databases.  I had five main goals in designing this package:
<P>
<OL>
<LI>
Keep the interface as similar as possible to the C stdio package,
given the fact that we're dealing with sequences and sequence entries
instead of characters and lines.
<LI>
Support as many sequence file formats as possible, and make it
relatively quick and easy (at least for me) to add new formats.
<LI>
Handle genome-sized sequences and large database searches, so that the
file I/O is no longer a time or space bottleneck for a program.
<LI>
Make the package flexible enough so that sequence analysis,
information retrieval and associating new information with sequences
(such as the results of the analyses) is easy.
<LI>
Try to help people do "something else" with the minimum of hassle
(where "something else" means getting information or performing some
computation that no one else has provided software for).
</OL>
The package defines a SEQFILE data structure, similar to stdio's FILE
data structure, that is opened and closed and is used to perform the
reading and writing of the files.  Sequences and sequence entries are
read or written one at a time, as a stream.

<P>
Because of the size and complexity of sequences and entries, internal
SEQIO data structures always retain the last read, or "current",
sequence and its entry.  A number of access functions are provided
to return information about that current sequence and entry.  So, you
could read the next sequence, examine it, and then use the access
functions to look at the database identifiers for the sequence, and
then get the complete entry text.  This is different from the stdio
package, which simply puts the characters into data structures that
you create and then forgets about those characters.

<P>
The package can retrieve a number of pieces of information stored in a
sequence entry, although it can't get everything (yet).  In addition
to the sequence and the complete entry text, it can get things like
the database identifiers, the description, the organism name, the
entry creation/update date, and so on.  A SEQINFO data structure
(listed below in the <A HREF="#seqinfo">Current Sequence Access
Functions</A> section) has been defined to hold all of this
information about a sequence or entry.  This is a transparent data
structure, unlike the SEQFILE structure, so that you can access and
modify the fields of the structure or create your own structures.
Since the information in the SEQINFO structure is used along
with the sequence when writing sequence entries, creating new
sequence entries is as simple as filling in the
fields of a SEQINFO structure and then passing it and the sequence to
the function `seqfwrite'.

<P>
The package can perform database searches using the BIOSEQ standard.
BIOSEQ was developed as a successor to the FASTLIBS method of the
FASTA program for specifying the files to be read.  The user (not
you, but the user of your program) creates a BIOSEQ file with entries
describing the various database, like this one for GenBank:
<PRE>
&gt;genbank,gb:   /databases/genbank
&gt;Name: GenBank
&gt;Alphabet:  DNA
    gbbct.seq, gbest.seq, gbinv.seq, gbmam.seq, gbpat.seq, gbphg.seq
    gbpri.seq, gbrna.seq, gbrod.seq, gbsts.seq, gbsyn.seq, gbuna.seq
    gbvrl.seq, gbvrt.seq

    bct:(gbbct.seq), est:(gbest.seq), inv:(gbinv.seq),  mam:(gbmam.seq)
    pat:(gbpat.seq), phg:(gbphg.seq), pri:(gbpri.seq),  rna:(gbrna.seq)
    rod:(gbrod.seq), sts:(gbsts.seq), syn:(gbsyn.seq),  una:(gbuna.seq)
    vrl:(gbvrl.seq), vrt:(gbvrt.seq)
</PRE>
The SEQIO package can read this file and the BIOSEQ related procedures
in the package, seqfopendb, bioseq_read, bioseq_info and bioseq_parse,
can be used to perform database searches and to retrieve information
about a database or the location of its files.  The full details of
the BIOSEQ standard are found in the file
"<A HREF="seqio_user.html#bioseq">user.doc</A>".


<P>
<H2>General Comments</H2>

As you read this file and use the package, here are some issues to
keep in mind:
<P>
<OL>
<LI>
How are returned values allocated?  The SEQIO package either returns
pointers to internal data structures (which may change after the next
SEQIO call) or returns malloc'ed buffers which you must free to avoid
a memory leak.

<P>
<LI>
How are errors handled?  The SEQIO package always sets an error
variable `<A HREF="#seqferrno">seqferrno</A>' and error string
`<A HREF="#seqferrstr">seqferrstr</A>' on an error.  Normally, the
package also reports warnings and errors on standard error, and will
exit the program on fatal errors (like running out of memory).  Using
`<A HREF="#seqferrpolicy">seqferrpolicy</A>' though, you can keep the
package from doing any or all of that (so that it only sets the error
variables and returns an error return value).  And using `<A
HREF="#seqfsetperror">seqfsetperror</A>', you can replace the
package's default print error function (which outputs to standard
error) with your own output function. 

<P>
<LI>
File formats are always specified by name, such as "GenBank", "FASTA",
"Stanford", and so on.

<P>
<LI>
Except for filenames, just about everything is case-insensitive (so
"GENBANK", "fasta" and "sTAnForD" are acceptable in point 3).

<P>
<LI>
One deficiency you might find in the package is that there is no
explicit link between a sequence and the SEQINFO structure for that
sequence.  I could not reconcile the package's ability to create new,
malloc'ed sequence and SEQINFO buffers, my wish to keep the amount of
allocated space to a minimum, and any mechanism for linking the SEQINFO
structure with a sequence.  You'll have to keep track of the sequences
and SEQINFO structures, and make sure that you are not mixing them up.

<P>
<LI>
Most of the names in the package should be fairly unique, except
possibly the predefined constants DNA, RNA, PROTEIN, AMINO and UNKNOWN
used to specify alphabets.  File 
"<A HREF="seqio_qref.html">quickref.doc</A>" gives a complete list of
the constants, typedef names and functions names defined by the
package. If these alphabetic constants interfere with the constants in
your program, just add the following lines AFTER including "seqio.h",
and then use the constants SEQIO_DNA, SEQIO_RNA, SEQIO_PROTEIN,
SEQIO_AMINO and SEQIO_UNKNOWN when looking at SEQIO's alphabet values.
<PRE>
#undef DNA
#undef RNA
#undef PROTEIN
#undef AMINO
#undef UNKNOWN

#define SEQIO_UNKNOWN 0
#define SEQIO_DNA 1
#define SEQIO_RNA 2
#define SEQIO_PROTEIN 3
#define SEQIO_AMINO 3
</PRE>
<P>
Or you can just use the constants predefined in the package in your
program (you should be able to figure out what values they have from
above).
</OL>

<HR>

<P>
<H1><A NAME="opening">Opening and Closing Files/Database-Searches</A></H1>

<P>
<H2><A NAME="seqfopen">Seqfopen</A></H2>

<blockquote>
SEQFILE *seqfopen(char *filename, char *mode, char *format)
<P>
<UL>
<LI> filename - the file to be opened
<LI> mode - "r", "w" or "a"
<LI> format - the file format name (optional for reading)
<P>
<LI> returns an open SEQIO file structure (or NULL on error)
</UL>
</blockquote>

The seqfopen function opens a file for reading or writing.  It is
similar to stdio's fopen, except it returns a SEQFILE structure and
it has an extra `format' argument.  This argument specifies the
format of the file to be read or written.  See the files
"<A HREF="seqio_user.html#formats">user.doc</A>" and
"<A HREF="seqio_format.html">format.doc</A>" for a description of the
valid formats. 

<P>
In addition to normal filenames, the package recognizes several
special characters.  First, the string containing only a dash ("-")
specifies that standard input or standard output should be opened
instead.  Second, filenames beginning with a `~' are treated the same
way as the Unix shells (i.e., the `~' is used to refer to home
directories).  Finally, access to single entries of the file can be
specified using an `@', followed by a single entry access
specification.  See "<A HREF="seqio_user.html#access">user.doc</A>"
for a description of how to specify single entry access.<BR> 
<I>(Note: This specification means that seqfopen will not accept
filenames containing an ampersand character.  It will always assume
the `@' denotes a single entry access specification.)</I>

<P>
When reading a file, the `format' argument can be NULL, in which case
seqfopen tries to automatically determine the format of the file.  If
the file format is any of the valid formats except the Raw format,
seqfopen should be able to determine the correct format. If it cannot
determine the correct format, the SEQIO package does not fail to open
the file.  It simply triggers a warning and opens the file in the
Plain format.


<P>
<H2><A NAME="seqfopendb">Seqfopendb</A></H2>

<blockquote>
SEQFILE *seqfopendb(char *dbspec)
<P>
<UL>
<LI> dbspec - a BIOSEQ database search specification
<P>
<LI> returns an open SEQIO file structure (or NULL on error)
</UL>
</blockquote>

The seqfopendb function opens a SEQFILE structure to perform a
database search.  The one argument specifies what database (or part of
a database) to search.  All of the sequences in that database (or
database part) can then be read as if they were stored in a single
sequence file.  See the file "<A
HREF="seqio_user.html#access">user.doc</A>" for a description of a
valid BIOSEQ database specification. 

<P>
This function also looks for five information fields from the BIOSEQ
entry for the database.  These fields are not required to occur in an
entry.  The "Name" field specifies the name of the database described
by that BIOSEQ entry and is used to distinguish between "official"
databases and just collections of entries.  The "IdPrefix" field gives
the identifier prefix to attach to any identifier in an entry which
does not already have a prefix.  The "Format" field specifies the file
format to pass to `seqfopen' for each database file read.  The
"Alphabet" field specifies the alphabet for each sequence in the
database.  And finally, the "Index" field gives the name of the file
indexing all of the database's entries (which is used to randomly
access the entries).


<P>
<H2><A NAME="seqfopen2">Seqfopen2</A></H2>

<blockquote>
SEQFILE *seqfopen2(char *string)
<P>
<UL>
<LI> string - the filename (if it specifies an existing file) or
database search specifier (otherwise)
<P>
<LI> returns an open SEQIO file structure (or NULL on error)
</UL>
</blockquote>

The seqfopen2 function is a "combination" function which you can use
to avoid having to decide whether to call seqfopen or seqfopendb.  If
the parameter specifies an existing file (tested using the stat system
call), then seqfopen is called (with the second and third arguments of
"r" and NULL, resp.).  Otherwise, seqfopendb is called.


<P>
<H2><A NAME="seqfclose">Seqfclose</A></H2>
 
<blockquote>
void seqfclose(SEQFILE *sfp)
<P>
<UL>
<LI> sfp - the SEQFILE structure to be closed
<P>
<LI> returns nothing
</UL>
</blockquote>

The seqfclose function closes an open SEQFILE structure, closing any
open FILE structures it uses and freeing up the memory allocated to
it.

<P>
<HR>

<P>
<H1><A NAME="reading">Reading Sequences/Entries</A></H1>

<P>
<H2><A NAME="seqfread">Seqfread</A></H2>

<blockquote>
int seqfread(SEQFILE *sfp, int flag)
<P>
<UL>
<LI> sfp - an open SEQFILE structure
<LI> flag - read the next sequence (if zero) or entry (non-zero)
<P>
<LI> returns 0 on success and -1 on EOF or error
</UL>
</blockquote>

The seqfread function reads the next sequence or entry.  The
difference between reading the next sequence and reading the next
entry only appears with formats that contain multiple sequences per
entry, such as the PHYLIP or Clustalw formats.  When reading "by
sequence", each sequence of a multiple sequence entry is read and
kept as the current sequence.  If seqfread is called with a non-zero
`flag' argument, then the next entry is always read, even if some
sequences of the current entry have not been read.

<P>
This function only returns a status value because the current sequence
and entry text are kept in the internal SEQIO data structures.  The
function just changes the value of the current sequence and current
entry.


<P>
<H2><A NAME="seqfgetseq">Seqfgetseq</A>,
<A NAME="seqfgetrawseq">Seqfgetrawseq</A>,
<A NAME="seqfgetentry">Seqfgetentry</A>,
<A NAME="seqfgetinfo">Seqfgetinfo</A></H2> 

<blockquote>
char *seqfgetseq(SEQFILE *sfp, int *length_out, int newbuffer)<BR>
char *seqfgetrawseq(SEQFILE *sfp, int *length_out, int newbuffer)<BR>
char *seqfgetentry(SEQFILE *sfp, int *length_out, int newbuffer)<BR>
SEQINFO *seqfgetinfo(SEQFILE *sfp, int newbuffer)
<P>
<UL>
<LI> sfp - an open SEQFILE structure
<LI> length_out - address where the returned string's length is
stored (if not NULL)
<LI> newbuffer -  malloc a new buffer for the object (if non-zero) or
return an internal buffer (if zero)
<P>
<LI> returns the sequence/entry text or the SEQINFO structure (or NULL
on error)
</UL>
</blockquote>

These functions simply call seqfread to read the next sequence or
entry, and then call the access function to return the sequence text,
entry text or sequence information.  The seqfgetseq, seqfgetrawseq and
seqfgetinfo functions call seqfread with a `flag' value of 0, while
the seqfgetentry function calls seqfread with a `flag' value of 1.
This way, the search by entry will look at each entry exactly once (it
won't repeatedly look at the same multiple sequence entry), and the
search by sequence and search by information will look at each
sequence exactly once.

<P>
See the access functions description next for a more complete
description of the arguments and their use.

<P>
<HR>

<P>
<H1>Access Functions for the Current Sequence, Entry and Information</H1>

<P>
<H2><A NAME="seqfsequence">Seqfsequence</A>,
<A NAME="seqfrawseq">Seqfrawseq</A>,
<A NAME="seqfentry">Seqfentry</A>,
<A NAME="seqfinfo">Seqfinfo</A>,
<A NAME="seqfallinfo">Seqfallinfo</A></H2> 

<blockquote>
char *seqfsequence(SEQFILE *sfp, int *length_out, int newbuffer)<BR>
char *seqfrawseq(SEQFILE *sfp, int *length_out, int newbuffer)<BR>
char *seqfentry(SEQFILE *sfp, int *length_out, int newbuffer)<BR>
SEQINFO *seqfinfo(SEQFILE *sfp, int newbuffer)<BR>
SEQINFO *seqfallinfo(SEQFILE *sfp, int newbuffer)
<P>
<UL>
<LI> sfp - an open SEQFILE structure
<LI> length_out - address where the returned string's length is
stored (if not NULL)
<LI> newbuffer -  malloc a new buffer for the object (if non-zero) or
return an internal buffer (if zero)
<P>
<LI> returns the sequence/entry text or the SEQINFO structure (or NULL
on error)
</UL>
</blockquote>

These functions are the access functions by which the current sequence
text, the current entry text or the information about the current
sequence can be retrieved.  The seqfinfo and seqfallinfo functions are
described in the next section.

<P>
The seqfsequence, seqfrawseq and seqfentry functions return the
current sequence text, the raw sequence text or the current entry
text, respectively.  They also return the length of the text, if the
second argument is an integer variable pointer (such as `<CODE>s =
seqfsequence(sfp, &amp;len, 0)</CODE>').

<P>
(NOTE: The function seqfrawseq differs from seqfsequence in that it
also includes any alignment or notational characters in its returned
"sequence".  Typically, seqfsequence only extracts alphabetic
characters in the sequence lines of the entry, whereas seqfrawseq
extracts all characters except whitespace and digits.)

<P>
The returned text is stored either in an internal SEQIO buffer or in a
newly malloc'ed buffer, depending on the value of the `newbuffer'
argument.  With both kinds of buffers, you are permitted to modify or
rewrite the characters of the text as needed (these are NOT read-only
buffers), with two exceptions.  First, you may not write past the end
of the returned text, because you would be writing off the end of a
malloc'ed buffer or onto other text stored in an internal SEQIO
buffer.  So, you can make the string shorter, but not longer.

<P>
Second, if you use the internal buffer, any permanent changes you make
could affect future calls involving that sequence or entry (what is
being returned is the buffer storing the only copy SEQIO has of the
sequence/entry).  The idea is to use the internal buffers when the
sequence/entry will only be kept around temporarily and any changes
made are undone when the text is no longer needed, and to use the
malloc'ed buffer for the sequences/entries that must be kept around
for a long time.


<P>
<H2><A NAME="seqinfo">The SEQINFO Structure</A></H2>

The seqfinfo functions returns a SEQINFO structure containing various
information about the current sequence and its entry.  So, what
information does the SEQINFO structure hold?  Here is the C
definition:
<PRE>
typedef struct {
  char *dbname, *filename, *format;
  int entryno, seqno, numseqs;

  char *date, *idlist, *description;
  char *comment, *organism, *history;
  int isfragment, iscircular, alphabet;
  int fragstart, truelen, rawlen;
} SEQINFO;
</PRE>
The structure contains six fields which the SEQIO package has about
the current sequence:
<DL>
<DT> dbname
<DD>
The name of the database being searched (if this is a search of an
actual database).
<DT> filename
<DD>
The name of the file currently being read.
<DT> format
<DD>
The format of the file (and the current entry).
<DT> entryno
<DD>
The location of the current entry in the file (if entryno is 10, then
the current entry is the tenth entry in the file).
<DT> seqno
<DD>
The location of the current sequence in the current entry (if seqno is
3, then the current sequence is the third in the current entry).
<DT> numseqs
<DD>
The number of sequences contained in the current entry.
</DL>
So, the current sequence's location is the `seqno' sequence of the
`entryno' entry of the file `filename' (possibly of the database
`dbname'). The `format' string gives the entry's format, and the entry
contains `numseqs' sequences.

<P>
The other twelve fields are information extracted from the current
entry (see "<A HREF="seqio_format.html">format.doc</A>" for the
details about which information is retrieved for each file format):
<DL>
<DT> date
<DD>
A single date giving the last time the entry was either created or
updated.  Its format should be day-month-year, as in 31-JAN-1995.
<DT> idlist
<DD>
The list of identifiers given in the entry.  The idlist's form is a
string containing vertical bar separated list of identifiers, each of
whose form consists of an identifier prefix, a ':' and the identifier.
See file "<A HREF="seqio_user.html#idents">user.doc</A>" for more
information about identifiers and identifier prefixes.
<DT> description
<DD>
A description of the sequence or sequences in the entry.  This is the
"Title" or "Definition" line in some file formats.  This string should
consist of a single "line" of text, although it can be of any length.
So, no newlines should appear in this text (they are removed and added
when the description is read from and output in the sequence entries).
<DT> comment
<DD>
A block of text giving a comment about the sequence.  The string can
contain one or more lines of any length.  The one restriction to the
text appearing in a comment is that any block of lines at the end of
an entry's comment section where each line begins with the string
"SEQIO" is reserved for other use by the package (this block holds
extra identifiers or the `history' lines).
<DT> organism
<DD>
The name of the organism the sequence was taken from.  Right now, this
field can contain any single "line" of text, although I would like to
standardize the contents of this field.  It's on my TODO list.
<DT> history
<DD>
This holds the lines of text placed in the comment section of entries
which describe previous SEQIO operations on this entry, i.e., it holds
the history of alterations and updates made to this entry by programs
using the SEQIO package.  Any block of lines at the end of a comment
section where each line begins with the string "SEQIO" is not
considered part of the comment, but part of the history.
<DT> isfragment
<DD>
This integer is non-zero if the sequence is a fragment of a larger
sequence, and zero if the sequence is complete (or if it is not known
whether the sequence is a fragment).
<DT> iscircular
<DD>
This integer is non-zero if the sequence is a circular sequence, and
zero if it is a linear sequence (or if it's circularity is not known).
<DT> alphabet
<DD>
This integer is one of the predefined constants DNA, RNA, PROTEIN or
UNKNOWN.  Its value is UNKNOWN unless either the database's BIOSEQ
entry (information field "Alphabet") or the entry itself explicitly
specifies the alphabet.  The package does not try to guess the
alphabet.
<DT> fragstart
<DD>
When the sequence is a fragment of a larger sequence and the location
of this fragment in the larger sequence is known, this value gives the
starting position of the fragment.  If this value is not known (or the
sequence is complete), fragstart is set to 0.
<DT> truelen
<DD>
This is the "true" length of the sequence, i.e., the length of the
sequence without any gap characters or notational characters.
Typically, these are just the alphabetic characters.
<DT> rawlen
<DD>
This is the "raw" length of the sequence, i.e., the length of the
sequence which includes the gap and notational characters.  Typically
these are all characters except whitespace and digits.
</DL>

The seqfinfo function fills in as many fields of the SEQINFO structure
as it can, given the information in the entry.  If a particular piece
of information could not be found in the entry, that field is set
either to NULL or 0, depending on whether it is a character string or
an integer.  (And, yes, I know NULL and 0 are really the same value.
I'm talking about good programming style here.)

<P>
The seqfallinfo function also returns a SEQINFO structure and the
information is the same as that returned by seqfinfo, with the
exception of the `comment' field.  With seqfallinfo, the entire header
for the current entry is stored in the comment field of the SEQINFO
structure.  The meaning of "entire header" is different for the
different formats, but the general idea is that it consists of
everything in the entry except the sequence lines.  This can be useful
for converting from one format to another without losing the header
lines describing the references and features of the sequence.

<P>
As with the seqfsequence and seqfentry, the structure returned by
these two functions is either an internal SEQIO structure or a
malloc'ed structure.  However, the malloc'ed structure is a bit more
complicated here, since the SEQINFO structure contains character
strings for some of its fields.  What the SEQIO package does is malloc
one big buffer in which it stores the SEQINFO structure and all of the
character strings that its fields point to.  So, when you call free to
"free" the SEQINFO structure, you also automatically free all of the
character strings.  This means that the same restrictions specified
for the returned sequence/entry text above also apply to these
character strings.

<P>
<H2><A NAME="fieldfns">SEQINFO Field Access Functions</A></H2>

<blockquote>
char *seqfdbname(SEQFILE *sfp, int newbuffer)<BR>
char *seqffilename(SEQFILE *sfp, int newbuffer)<BR>
char *seqfformat(SEQFILE *sfp, int newbuffer)<BR>
int seqfentryno(SEQFILE *sfp)<BR>
int seqfseqno(SEQFILE *sfp)<BR>
int seqfnumseqs(SEQFILE *sfp)<BR>
char *seqfdate(SEQFILE *sfp, int newbuffer)<BR>
char *seqfidlist(SEQFILE *sfp, int newbuffer)<BR>
char *seqfdescription(SEQFILE *sfp, int newbuffer)<BR>
char *seqfcomment(SEQFILE *sfp, int newbuffer)<BR>
char *seqforganism(SEQFILE *sfp, int newbuffer)<BR>
int seqfisfragment(SEQFILE *sfp)<BR>
int seqfiscircular(SEQFILE *sfp)<BR>
int seqfalphabet(SEQFILE *sfp)<BR>
int seqffragstart(SEQFILE *sfp)<BR>
int seqftruelen(SEQFILE *sfp)<BR>
int seqfrawlen(SEQFILE *sfp)
<P>
<UL>
<LI> sfp - an open SEQFILE structure
<LI> newbuffer -  malloc a new buffer for the object (if non-zero) or
return an internal buffer (if zero)
<P>
<LI> returns the information string or integer
</UL>
</blockquote>

These are the access functions used to retrieve individual pieces of
information about the current sequence and current entry.  Like
seqfsequence and seqfentry, the functions which return strings can
return either an internal SEQIO buffer or a malloc'ed buffer
containing the string (with the same restrictions and requirements on
its use).

<P>
The functions returning strings all return NULL on an error.  The
other access functions all return 0 on an error, even though 0 may be
a valid return value (such as in seqfiscircular).  Note that the
alphabet UNKNOWN is defined to be 0.

<P>
<H2><A NAME="seqfmainid">Seqfmainid</A>,
<A NAME="seqfmainacc">Seqfmainacc</A></H2>

<blockquote>
char *seqfmainid(SEQFILE *sfp, int newbuffer)<BR>
char *seqfmainacc(SEQFILE *sfp, int newbuffer)
<P>
<UL>
<LI> sfp - an open SEQFILE structure
<LI> newbuffer -  malloc a new buffer for the object (if non-zero) or
return an internal buffer (if zero)
<P>
<LI> returns the identifier string or NULL
</UL>
</blockquote>

These functions access the idlist information collected from an entry
and return either the main identifier or main accession number
occurring in the entry.  The main identifier is considered to be
either the first identifier which is not an accession number, or the
first accession number if no non-accession identifiers are found in
the entry.  Thus, seqfmainid is NULL only when no identifiers could be
extracted from the entry.  The main accession number is the first
accession number found.

<P>
The string returned by these functions consists of a single
identifier, whose form is the same as each identifier in the idlist
(i.e., an identifier prefix, a `:' and the identifier string).
This string will be NULL-terminated (it is not just a pointer into the
idlist).  And, as with all of the other information strings, it can be
returned either in an internal buffer or a newly malloc'ed buffer.

<P>
<H2><A NAME="seqfoneline">Seqfoneline</A></H2>

<blockquote>
int seqfoneline(SEQINFO *info, char *buffer, int buflen, int idonly)
<P>
<UL>
<LI> info - a SEQINFO structure
<LI> buffer - the buffer to store the oneline description
<LI> buflen - the buffer length
<LI> idonly - only store an identifier for the sequence
<P>
<LI> returns the length of the string stored in buffer
</UL>
</blockquote>

This function is used to construct a "oneline" description of a
sequence, based on the information stored in the SEQINFO structure.
See "<A HREF="seqio_user.html#one-line">user.doc</A>" for a complete
description of the format for a oneline description.  That description
is stored in the character array specified by the "buffer" argument.

<P>
The function guarantees that it will fit the oneline description into
the first "buflen-1" characters of buffer, and that the description
will always be NULL-terminated.  So, you won't have to check for a
buffer overflow (like you do with function `fgets').

<P>
If the "idonly" value is non-zero, then the oneline description will
consist only of a single identifier for the sequence, and the
description will not contain any whitespace.  This is useful
for constructing short identifiers for sequences (most notably, the
identifiers used in the PHYLIP, Clustalw and MSF formats).

<P>
<H2><A NAME="seqfsetidpref">Seqfsetidpref</A>,
<A NAME="seqfsetdbname">Seqfsetdbname</A>,
<A NAME="seqfsetalpha">Seqfsetalpha</A></H2>

<blockquote>
void seqfsetidpref(SEQFILE *sfp, char *idprefix)<BR>
void seqfsetdbname(SEQFILE *sfp, char *dbname)<BR>
void seqfsetalpha(SEQFILE *sfp, char *alphabet)
<P>
<UL>
<LI> sfp - a SEQFILE structure open for reading
<LI> idprefix - the identifier prefix (if not NULL or not empty)
<LI> dbname - the current database name(if not NULL or not empty)
<LI> alphabet - the string used to determine the alphabet when the
entry does not specify an alphabet (if not NULL or not empty)
<P>
<LI> returns nothing
</UL>
</blockquote>

These are functions which allow you to set the identifier prefix,
database name, and alphabet when reading a sequence file or performing
a database search.  The idea is that these functions provide the same
capability as the inclusion of the "Name", "IdPrefix" and "Alphabet"
information fields in the BIOSEQ entry used by a database search.  The
use of these values in a database search is described in the comments
for `<A HREF="#seqfopendb">seqfopendb</A>' above and in file
"<A HREF="seqio_progr.html#bioseq">programr.doc</A>" in the BIOSEQ Stuff
section.

<P>
If the second argument to the function is either NULL or an empty
string, then the current value held in the SEQFILE structure is
removed.  This gives a way to unset any of those values as needed.

<P>
(NOTE: The `alphabet' argument to "seqfsetalpha" is a character string,
not one of the predefined constants RNA, DNA, PROTEIN or UNKNOWN.  The
reason for doing that is so that any string specifying a valid
alphabet (i.e., strings like "tRNA", "Peptide", "cDNA", "pre-mRNA")
can be input to the package, but the value set in the SEQINFO
structure is simplified so that programs who just need to know whether
the sequence is DNA, RNA or PROTEIN can just test the integer SEQINFO
field.

<P>
One of the things on my TODO list is to add an `alphastr' field
to the SEQINFO structure to return the actual alphabet string that
occurs in the entry or is given to the package.)

<P>
<HR>

<P>
<H1><A NAME="writing">Writing Sequences/Entries</A></H1>

<P>
<H2><A NAME="seqfwrite">Seqfwrite</A></H2>

<blockquote>
int seqfwrite(SEQFILE *sfp, char *seq, int seqlen, SEQINFO *info)
<P>
<UL>
<LI> sfp - a SEQFILE structure open for writing
<LI> seq - the sequence
<LI> seqlen - the sequence length
<LI> info - information about the sequence
<P>
<LI> returns 0 on success and -1 on error
</UL>
</blockquote>

The seqfwrite function outputs a sequence entry using the sequence and
information given in the function arguments.  Only the twelve "entry
information" fields of the SEQINFO structure (date, mainid, mainacc,
idlist, description, comment, organism, history, isfragment,
iscircular, alphabet, truelen) are used when outputting the entry.

<P><I>
(NOTE: For the ASN.1, PHYLIP and Clustalw formats, remember that it is
important that you call seqfclose to close the file being written, as
the package performs some output during that close operation.)</I>


<P>
<H2><A NAME="seqfconvert">Seqfconvert</A></H2>

<blockquote>
int seqfconvert(SEQFILE *input_sfp, SEQFILE *output_sfp)
<P>
<UL>
<LI> input_sfp - a SEQFILE structure open for reading
<LI> output_sfp - a SEQFILE structure open for writing
<P>
<LI> returns 0 on success and -1 on error
</UL>
</blockquote>

The seqfconvert function retrieves the sequence and information about
the current sequence of `input_sfp' and then calls seqfwrite with 
`output_sfp', the sequence and the information.


<P>
<H2><A NAME="seqfputs">Seqfputs</A></H2>

<blockquote>
int seqfputs(SEQFILE *sfp, char *s, int len)
<P>
<UL>
<LI> sfp - a SEQFILE structure open for writing
<LI> s - the string to output
<LI> len - the number of chars to output (or 0, specifying to output
to the end of s)
<P>
<LI> returns 0 on success and -1 on error
</UL>
</blockquote>

The seqfputs function outputs a string on the output stream opened for
the SEQFILE structure.  The purpose of this function is to give a way
to mix the output of complete entries with the output of entries
produced by seqfwrite or seqfannotate.  Thus, programs can take
differently formatted entries as input (i.e., a combination of
GenBank, FASTA and EMBL entries), and transform them into a single
output file format without losing any information in input entries
whose format matches the output format (i.e., output all in EMBL
format using seqfwrite on the GenBank and FASTA entries and seqfputs
on the EMBL entries).

<P>
The function makes no checks on the string to ensure that the output
consists of a single format.  When using this function, you must
ensure that the output consists of complete entries of a single
format (since you can use seqfputs to output entries line by line).

<P>
The `len' argument either specifies the number of characters to output
(if non-zero), or that the complete string should be output (if zero).
If the length is non-zero, exactly that many characters will be
written to the output.

<P>
<H2><A NAME="seqfannotate">Seqfannotate</A></H2>

<blockquote>
int seqfannotate(SEQFILE *sfp, char *entry, int entrylen, char *newcomment,
                 int flag)
<P>
<UL>
<LI> sfp - a SEQFILE structure open for writing
<LI> entry - the entry text to output
<LI> entrylen - the length of the entry text
<LI> newcomment - the comment to add to the entry
<LI> flag - remove existing comments (if zero) or append the new
comment (if non-zero)
<P>
<LI> returns 0 on success and -1 on error
</UL>
</blockquote>

The seqfannotate function adds extra comment text to an entry as it
outputs the entry.  This way, you can insert new information into an
entry without losing any of the information in the entry.  The new
text will be output as part of the entry's comment, so that when the
output entry is read again, seqfcomment can be used to access the
inserted text.  Also, the `flag' argument can be set to strip out any
existing comments, so that when the output entry is read, the string
returned by seqfcomment is just that inserted text.  This should
provide an easy method for running a number of experiments on
sequences, while storing the results of those experiments in the
sequences' entries.

<P>
The format for the entry must be the same as the format specified when
the SEQFILE structure was opened for writing.  Any mismatch in the two
formats will result in a parse error.

<P>
See file "<A HREF="seqio_progr.html#annotate">programr.doc</A>" for a
description of how this function can be used.

<P>
<H2><A NAME="seqfgcgify">Seqfgcgify</A></H2>

<blockquote>
int seqfgcgify(SEQFILE *sfp, char *entry, int entrylen)
<P>
<UL>
<LI> sfp - a SEQFILE structure open for writing
<LI> entry - the entry text to convert to the GCG format
<LI> entrylen - the length of the entry text
<P>
<LI> returns 0 on success and -1 on error
</UL>
</blockquote>

This function converts an entry from its non-GCG format into its GCG
format, without losing any of the header line information.  This
operation will work only for the GenBank, PIR, EMBL, Swiss-Prot,
FASTA, NBRF or IG/Stanford formats.  Also, the SEQFILE structure must
have been opened either with the generic "GCG" format, or the "GCG-*"
format matching the format of the entry.  Any mismatch in formats will
result in a parse error.


<P>
<H2><A NAME="seqfungcgify">Seqfungcgify</A></H2>

<blockquote>
int seqfungcgify(SEQFILE *sfp, char *entry, int entrylen)
<P>
<UL>
<LI> sfp - a SEQFILE structure open for writing
<LI> entry - the entry text to convert to the GCG format
<LI> entrylen - the length of the entry text
<P>
<LI> returns 0 on success and -1 on error
</UL>
</blockquote>

This function converts an entry from its GCG format into its non-GCG
format, without losing any of the header line information.  This
operation will work only for the GenBank, PIR, EMBL, Swiss-Prot,
FASTA, NBRF or IG/Stanford formats.  Also, the SEQFILE structure must
have been opened the non-GCG format, and the format of the entry must
be the corresponding "GCG-*" format.  Any mismatch in formats will
result in a parse error.


<P>
<HR>

<P>
<H1><A NAME="bioseq">BIOSEQ Database Functions</A></H1>

<P>
<H2><A NAME="bioseq_read">Bioseq_read</A></H2>

<blockquote>
int bioseq_read(char *filelist)
<P>
<UL>
<LI> filelist - a comma separated list of files (must be BIOSEQ files)
<P>
<LI> returns 0 on success and -1 on error
</UL>
</blockquote>

The bioseq_read function reads all of the files in the comma separated
list of files and adds the BIOSEQ entries it reads to the internal
list of entries.  These new entries are added to the front of the
list, so that the newer entries will always be found before the older
entries.  Thus, you should call bioseq_read with the files in
increasing priority (the entries you want examined first should be the
last entries added).

<P>
By default, the files specified by the environment variable "BIOSEQ"
(if it exists) are always the first file read in (this is done
automatically before any bioseq_* function is processed).  If this
does not suit your priority scheme, simply call bioseq_read with the
"BIOSEQ" environment variable value in its proper position in the
priority scheme.  Those entries will override the previously read
entries.

<P>
(Note: This function has the capability to read a complete comma
separated list of files, but the typical use of this function will
probably be to read a single file, except when handling the "BIOSEQ"
environment variable.)


<P>
<H2><A NAME="bioseq_check">Bioseq_check</A></H2>

<blockquote>
int bioseq_check(char *dbspec)
<P>
<UL>
<LI> dbspec - a database search specifier
<P>
<LI> returns non-zero if the string refers to a known database, or
returns zero otherwise
</UL>
</blockquote>

The bioseq_check function can be used to test whether a database
search specification refers to a database known to the package (i.e.,
if a BIOSEQ entry exists for that database).


<P>
<H2><A NAME="bioseq_info">Bioseq_info</A></H2>

<blockquote>
char *bioseq_info(char *dbspec, char *fieldname)
<P>
<UL>
<LI> dbspec - a database search specifier
<LI> fieldname - the name of the information field to be returned
<P>
<LI> returns the text for that field of the BIOSEQ entry for that
database.<BR>
(NOTE:  the returned string buffer is a malloc'ed buffer, and it must
be freed by you.) 
</UL>
</blockquote>

The bioseq_info function is used to retrieve an information field
from a BIOSEQ entry.  The file 
"<A HREF="seqio_user.html#ifields">user.doc</A>" describes the BIOSEQ
entry format and how information fields can be added to an entry.  The
returned text is always stored in a malloc'ed buffer which you must free
after you've finished using.

<P>
Note that the `dbspec' argument does NOT have to be a simple database
name, but can be a fulle database search specifier.  This is so you
can use the same specification to start a database search and get
information about the database being searched, without having to
explicitly extract the database name from the specification.


<P>
<H2><A NAME="bioseq_matchinfo">Bioseq_matchinfo</A></H2>

<blockquote>
char *bioseq_matchinfo(char *fieldname, char *fieldvalue)
<P>
<UL>
<LI> fieldname - the name of an information field
<LI> fieldvalue - the value that the information field should have
<P>
<LI> returns the name of a database.<BR>
(NOTE:  the returned string buffer is a malloc'ed buffer, and it must
be freed by you.) 
</UL>
</blockquote>

The bioseq_matchinfo function is used to determine which database
contains an information field with a specific value.  (The package
uses it to find the database corresponding to a particular identifier
prefix, as in `<CODE>bioseq_matchinfo("IdPrefix", "ec")</CODE>'.)

<P>
The function traverses the list of BIOSEQ entries, looking for the
first one which has an information field that matches both the
fieldname and fieldvalue.  The fieldname and fieldvalue matching are
both case-insensitive, and any whitespace separating the fieldname and
the fieldvalue is ignored.  So, the call above would match the
information line `<CODE>&gt;IDPREFIX: EC</CODE>', regardless of how many
spaces separate the `<CODE>:</CODE>' and `<CODE>EC</CODE>'.


<P>
<H2><A NAME="bioseq_parse">Bioseq_parse</A></H2>

<blockquote>
char *bioseq_parse(char *dbspec)
<P>
<UL>
<LI> dbspec - a database search specifier
<P>
<LI> returns the list of files in a string where each file is
terminated by a newline character and the whole string is terminated
by a NULL character.<BR>
(NOTE:  the returned string buffer is a malloc'ed buffer, and it must
be freed by you.)
</UL>
</blockquote>

The bioseq_parse function parses a BIOSEQ database search
specification and determines which files of the database need to be
searched.  That list of files is then returned in a malloc'ed buffer
which you must free.  The returned string terminates each filename by
a newline character (even the last file in the list), and the list of
filenames is ended by a NULL character (as with all strings).

<P>
<HR>

<P>
<H1><A NAME="misc">Miscellaneous Functions</A></H1>

<P>
<H2><A NAME="seqfisafile">Seqfisafile</A></H2>

<blockquote>
int seqfisafile(char *filename)
<P>
<UL>
<LI> filename - a filename (with a possible "@..." single entry access specification)
<P>
<LI> returns non-zero (if the filename refers to an existing file) or zero
(if not)
</UL>
</blockquote>

The seqfisafile function can be used to test whether a user-given
filename (which may contain a single entry access specification in
addition to the actual filename) refers to an existing file.  It first
checks for the single entry access specification (by looking for a
`@'), and then checks the prefix to see if it refers to an existing
file.

<P>
What the function does not do is parse the single entry access
specification (if it's there).  So, a non-zero return value does NOT
mean that seqfopen will succeed in opening the file.


<P>
<H2><A NAME="seqfisaformat">Seqfisaformat</A></H2>

<blockquote>
int seqfisaformat(char *format)
<P>
<UL>
<LI> format - a file format string
<P>
<LI> returns non-zero (if the string is a valid file format) or zero
(if not)
</UL>
</blockquote>

The seqfisaformat function can be used to test whether a string
specifies a valid file format or not.  It checks the given string
against all of the valid format names and returns a non-zero/zero
value telling whether a match occurred.


<P>
<H2><A NAME="seqffmttype">Seqffmttype</A></H2>

<blockquote>
int seqffmttype(char *format)
<P>
<UL>
<LI> format - a file format string
<P>
<LI> returns the format type or T_INVFORMAT (for an invalid
format)
</UL>
</blockquote>

The seqffmttype function returns some type information about the given
format.  The possible returned types are T_SEQONLY, T_DATABANK,
T_GENERAL, T_LIMITED, T_ALIGNMENT and T_OUTPUT.  See file 
"<A HREF="seqio_format.html#types">format.doc</A>"
for more information about what these types mean.


<P>
<H2><A NAME="seqfcanwrite">Seqfcanwrite</A></H2>

<blockquote>
int seqfcanwrite(char *format)
<P>
<UL>
<LI> format - a file format string
<P>
<LI> returns non-zero (if that format is writeable) or zero (if not)
</UL>
</blockquote>

The seqfcanwrite function tells whether or not the given file format
has writing capabilities.  Currently, every file format except
FASTA-output has this capability.


<P>
<H2><A NAME="seqfcanannotate">Seqfcanannotate</A></H2>

<blockquote>
int seqfcanannotate(char *format)
<P>
<UL>
<LI> format - a file format string
<P>
<LI> returns non-zero (if the format's entries can be annotated) or
zero (if not)
</UL>
</blockquote>

The seqfcanannotate function tells whether or not the given file
format has annotation capabilities.  Currently, the file formats
GenBank, EMBL, Swiss-Prot, PIR, FASTA, NBRF, IG/Stanford and ASN.1 do
have this capability, and the formats Raw, Plain, FASTA-old, NBRF-old,
IG-old, PHYLIP, Clustalw and FASTA-output do not.


<P>
<H2><A NAME="seqfcangcgify">Seqfcangcgify</A></H2>

<blockquote>
int seqfcangcgify(char *format)
<P>
<UL>
<LI> format - a file format string
<P>
<LI> returns non-zero (if the format's entries can be gcgified/ungcgified) or
zero (if not)
</UL>
</blockquote>

The seqfcangcgify function tells whether or not the given file format
can be converted from and to its GCG format.  Currently, the file
formats GenBank, EMBL, Swiss-Prot, PIR, FASTA, FASTA-old, NBRF,
NBRF-old, IG/Stanford and IG/Stanford-old have this capability, and
the other formats do not.


<P>
<H2><A NAME="seqfbytepos">Seqfbytepos</A></H2>

<blockquote>
void seqfbytepos(SEQFILE *sfp)
<P>
<UL>
<LI> sfp - a SEQFILE structure open for reading
<P>
<LI> returns a byte position, or -1 on error
</UL>
</blockquote>

The seqfbytepos function returns the byte position in the current file
of the beginning of the current entry.  If no current entry exists
(because of a previous error or because EOF was reached), the function
returns -1.


<P>
<H2><A NAME="seqfsetpretty">Seqfsetpretty</A></H2>

<blockquote>
void seqfsetpretty(SEQFILE *sfp, int value)
<P>
<UL>
<LI> sfp - a SEQFILE structure open for writing
<LI> value - either non-zero or zero
<P>
<LI> returns nothing
</UL>
</blockquote>

The seqfsetpretty function tells whether the output operations should
add some whitespace to make the sequence look prettier.  When
outputting in the Plain, FASTA, NBRF or IG/Stanford formats (and their
variants), the putseq operation looks at the sequence being output,
and may add whitespace to the outputted sequence to make it look
prettier.

<P>
By default, the extra spaces are added when the sequence is DNA, RNA
or Protein and when there are no non-alphabetic characters in the
sequence (such as alignment characters).


<P>
<H2><A NAME="seqfparseent">Seqfparseent</A></H2>

<blockquote>
SEQINFO *seqfparseent(char *entry, int entrylen, char *format)
<P>
<UL>
<LI> entry - the text of an entry
<LI> entrylen - the length of the entry
<LI> format - the format of the entry
<P>
<LI> returns a malloc'ed SEQINFO structure containing the
information about the entry.<BR>
(NOTE:  This structure must be freed by you.)
</UL>
</blockquote>

The seqfparseent function parses an entry and constructs a SEQINFO
structure containing the information occurring in the entry.  It could
be useful if you read in entries on your own, but still want to
retrieve the information stored in them.

<P>
(NOTE: This function cannot parse entries in the PHYLIP, Clustalw or
FASTA-output format.  An error is triggered if you call seqfparseent
with an entry in one of these formats.)


<P>
<H2><A NAME="asn_parse">Asn_parse</A></H2>

<blockquote>
int asn_parse(char *begin, char *end, ...)
<P>
<UL>
<LI> begin - the beginning of the ASN.1 text
<LI> end - the end of the ASN.1 text (i.e., the last character of the
ASN.1 text is at `end-1')
<LI> ... - a NULL terminated list of arguments specifying the
sub-records to be searched and the variables to store the beginning
and end positions of the sub-record text.<BR>
(NOTE:  These arguments must be given in groups of 3 until the NULL
termination, such as in
<PRE>
  "seq.id.genbank", &amp;gbstart, &amp;gbend,
  "seq.descr", &amp;destart, &amp;deend,
  NULL
</PRE>
<P>
The format for each triple is
<PRE>
  char *subrecord, char **begin_out, char **end_out
</PRE>
<P>
and either begin_out or end_out can be NULL.)
<P>
<LI> returns a count of the number of sub-records found or a -1 on
error
</UL>
</blockquote>

Since I found correctly parsing the ASN.1 text a much harder task than
the line oriented file formats, I've added this internal function to
the interface so that you might find it easier to move through the
ASN.1 records.  The description below assumes that you are familiar
with the ASN.1 text format, and in particular the structure of
"Bioseq-set.Seq-set.seq" records in the ASN.1 Bioseq-set hierarchy
defined in the NCBI toolkit.  The function itself can work with any
correctly formatted ASN.1 text, however.

<P>
The asn_parse function takes a piece of ASN.1 text (from `begin' to
`end') that specifies either one or more records in a hierarchy (i.e.,
you need not specify the top-most record, but the beginning of the
text must be at the beginning of some record in the hierarchy and
every record whose beginning starts inside the text must be complete).
The arguments that you give after `end' specify the sub-records you
are interested in, along with pointer variables which will be set to
the beginning and end of those sub-records, if they are found.

<P>
The strings naming the desired sub-records should match the structure
of sub-records in the text's hierarchy.  In most cases, the string is
simply a list of the initial keywords of the sub-records separated by
periods.  However, when a sub-record does not have an initial keyword,
but begins with an open brace, that open brace must be included in the
sub-record identifier string.  One example of this in the Bioseq 
hierarchy is the Bioseq-set.seq-set.seq.annot records (here is an
example)
<PRE>
annot {
  {
    data
      ftable {
        {
          data
            prot {
              name {
                "nifS protein" } } ,
          location
            whole
              gi 77963 } } } }
</PRE>
To access sub-records in this record, the strings must appear as
"annot.{.data" or "annot.{.data.ftable.{.location".  The open braces
after keywords are not specified, but braces without keywords are.
Also, note that in this example, the keywords "prot", "whole" and "gi"
are NOT initial keywords recognized by asn_parse, because they do not
appear at the beginning of a sub-record (i.e., after an open brace
starting a sub-record or after a comma separating sub-records).

<P>
In the parameter description above (the `...' description), I'm assuming
that the user gives asn_parse the text for a Bioseq `seq' record (see the
NCBI toolkit documentation for the structure of this record) and wants to
get the "id.genbank" and "descr" sub-records.  As another example, here
is an excerpt from the SEQIO function implementing the seqfinfo operation
for the ASN.1 file format.  The first part of it looks for identifiers
and accession numbers, and the second looks for comments.
<PRE>
  /*
   * Find the "id" and "descr" sub-records in the "seq" record.
   */
  idstr = destr = NULL;
  status = asn_parse(entry, entry + entrylen,
                     "seq.id", &amp;idstr, &amp;idend,
                     "seq.descr", &amp;destr, &amp;deend,
                     NULL);
  if (status == -1)
    /* error handling code here */

  /*
   * If there was an "id" sub-record, look for all of the possible
   * sub-records that specify database identifiers or accession
   * numbers.
   */ 
  if (idstr != NULL) {
    pirname = spname = gbname = emname = otname = prfname = dbjname = NULL;
    piracc = spacc = gbacc = emacc = otacc = prfacc = dbjacc = NULL;
    pdbmol = gistr = giim = gibbs = gibbm = NULL;
    status = asn_parse(idstr, idend,
                       "id.pir.name", &amp;pirname, &amp;pnend,
                       "id.swissprot.name", &amp;spname, &amp;spend,
                       "id.genbank.name", &amp;gbname, &amp;gbend,
                       "id.embl.name", &amp;emname, &amp;emend,
                       "id.ddbj.name", &amp;dbjname, &amp;dbjend,
                       "id.prf.name", &amp;prfname, &amp;prfend,
                       "id.other.name", &amp;otname, &amp;otend,
                       "id.pir.accession", &amp;piracc, &amp;paend,
                       "id.swissprot.accession", &amp;spacc, &amp;spaend,
                       "id.genbank.accession", &amp;gbacc, &amp;gbaend,
                       "id.embl.accession", &amp;emacc, &amp;emaend,
                       "id.ddbj.accession", &amp;dbjacc, &amp;dbjaend,
                       "id.prf.accession", &amp;prfacc, &amp;prfaend,
                       "id.other.accession", &amp;otacc, &amp;otaend,
                       "id.pdb.mol", &amp;pdbmol, &amp;pmend,
                       "id.gi", &amp;gistr, &amp;gisend,
                       "id.giim.id", &amp;giim, &amp;giimend,
                       "id.gibbsq", &amp;gibbs, &amp;gibbsend,
                       "id.gibbmt", &amp;gibbm, &amp;gibbmend,
                       NULL);
    if (status == -1)
      /* error handling code here */

    /* 
     * If the PIR identifier is found, extract it from the string
     * pirname+4 (+4 to skip the "name" keyword) to pnend using 
     * internal function `add_id'.
     */
    if (pirname != NULL)
      add_id(&amp;info, "pir", pirname+4, pnend);

    ...
  }

  /*
   * If the "description" sub-record exists, look for a "comment"
   * sub-record.
   */
  if (destr != NULL) {
    comment = NULL;
    status = asn_parse(destr, deend,
                       "descr.comment", &amp;comment, &amp;cmend,
                       NULL);
    if (status == -1)
      /* error handling code here */

    /*
     * If that first comment was found, use internal function
     * `add_comment' to add it to the SEQINFO structure, and then
     * look for other comments in the "descr" record, since there
     * can be more than one "comment" sub-record.
     *
     * Note the first two arguments to the asn_parse call are
     * `cmend+1' and `deend'.  So, these searches start from just
     * after the end of the last found comment and run until the
     * next "comment" record is found, or until the end of the
     * "descr" record.
     */
    if (comment != NULL) {
      add_comment(&amp;info, comment+7, cmend, 1);
      while (asn_parse(cmend+1, deend,
                       "comment", &amp;comment, &amp;cmend,
                       NULL) == 1)
        add_comment(&amp;info, comment+7, cmend, 1);
    }
  }
</PRE>
A couple of notes.  First, if both the beginning and end pointer
variables are specified (such as `&amp;comment' and `&amp;comend'
above), then either both variables will be set to a value if the
sub-record is found, or both will be left unchanged if the sub-record
is not found.  Thus, in the example above, I only needed to set and
test whether the beginning pointer variable was NULL, and did not need
to worry about the value of the end pointer variable (if the
beg. variable had been changed, then I was guaranteed that the end
variable had also been changed).

<P>
Second, the function returns the very beginning of the record it
is searching for, including the initial keyword starting the record.
So, when using the beginning and end variable values from one search
as the `begin' and `end' parameters to another search, the record
specifications for that second search must begin with that initial
keyword.  In the example above, the first search looked for "seq.id"
and "seq.descr", and the later searches then looked for "id.pir.name"
and "descr.comment".

<P>
Finally, the return value of the function is the count of the number
of sub-records found, or a -1 if a parse error occurring while
scanning the text.

<P>
<HR>

<P>
<H1><A NAME="errors">Error Handling/Reporting</A></H1>

<P>
<H2><A NAME="seqferrno">Seqferrno</A>,
<A NAME="seqferrstr">Seqferrstr</A></H2>

<blockquote>
extern int seqferrno;<BR>
extern char seqferrstr[];
</blockquote>

These are variables which are set whenever an error occurs in the
SEQIO package.  seqferrno gives a numerical error similar to Unix's
errno, and seqferrstr holds the error string that the SEQIO package
would have output (or perhaps did output depending on the error
policy).

<P>
The values that seqferrno can have are as follows:
<DL>
<DT> E_EOF (-1)
<DD> Reached the end of the file or database search.
<DT> E_NOERROR (0)
<DD> There is no error (the default value).
<DT> E_OPENFAILED (1)
<DD> An error occurred when opening a file.
<DT> E_READFAILED (2)
<DD> An error occurred when reading a file.
<DT> E_NOMEMORY (3)
<DD> Ran out of memory.
<DT> E_PROGRAMERROR (4)
<DD> A bug was detected in the SEQIO package itself.
<DT> E_PREVERROR (5)
<DD> A previous error occurred which does not permit the current
operation to succeed.
<DT> E_PARAMERROR (6)
<DD> An invalid parameter passed to a SEQIO function.
<DT> E_INVFORMAT (7)
<DD> An unknown file format was specified as the
argument to a function (like seqfopen).
<DT> E_DETFAILED (8)
<DD> The format of a file could not be automatically
determined from its contents.
<DT> E_PARSEERROR (9)
<DD> A parse error occurred while scanning a file.
<DT> E_DBPARSEERROR (10)
<DD>  A parse error occurred while scanning a BIOSEQ entry or a
database search specification.
<DT> E_DBFILEERROR (11)
<DD>  The BIOSEQ entry for a database specifies some files which could
not be located or opened.
<DT> E_NOSEQ (12)
<DD>  A sequence entry does not contain a sequence. (This error only
occurs if the sequence is actually requested, not for every entry
without a sequence.) 
<DT> E_DIFFLENGTH (13)
<DD>  There is a discrepancy between the length of a sequence, as
specified in the sequence entry, and the number of characters found
for the sequence. 
<DT> E_INVINFO (14)
<DD>  When outputting an entry, one of the fields in the SEQINFO
structure used to generate the output contains invalid information
(i.e., it is not formatted correctly).
<DT> E_FILEERROR (15)
<DD>  A parse error occurred while parsing a single entry access
specification that was given with a filename to seqfopen.
</DL>
   

<H2><A NAME="seqfperror">Seqfperror</A></H2>

<blockquote>
void seqfperror(char *s)
<P>
<UL>
<LI> s - a string (usually the program name) to be printed before the
error string
<P>
<LI> returns nothing
</UL>
</blockquote>

The seqfperror function is similar to that of Unix's perror function.
It outputs to standard error the text of the last error message (i.e.,
the contents of seqferrstr).  If the argument to the function is not
NULL, then that argument string and the string ": " are first output.
This argument is typically used to output the program name before the
error message.


<H2><A NAME="seqfsetperror">Seqfsetperror</A></H2>

<blockquote>
void seqfsetperror(void (*perr_fn)(char *))
<P>
<UL>
<LI> perr_fn - a void function that takes a string as its argument
<P>
<LI> returns nothing
</UL>
</blockquote>

The seqfsetperror function can be used to replace the default method
the SEQIO package has for reporting errors.  When an error is detected
and the error policy is such that an error message should be output, a
default print error function is used to output the error message to
stderr.  This function resets the print error function to either one
of your own choosing (if the argument is not NULL), or back to the
default print error function (if the argument is NULL).


<P>
<H2><A NAME="seqferrpolicy">Seqferrpolicy</A></H2>

<blockquote>
int seqferrpolicy(int pe)
<P>
<UL>
<LI> pe - sets the error policy
<P>
<LI> returns the old error policy
</UL>
</blockquote>

The seqferrpolicy function set an error policy for the SEQIO package
to follow when it detects an error has occurred.  The SEQIO package
detects three kinds of errors:
<P>
<OL>
<LI>
Warnings - when the package detects that something is wrong, but can
still complete the current operation and return an actual value.

<P>
Examples of warnings are E_DIFFLENGTH, where a sequence is found but
its length may be incorrect, or E_DETFAILED, where the file format
can't be determined and the Plain format is used.

<P>
On a warning, an error message is output.

<P>
The warning errno values are: E_DETFAILED, E_DBFILEERROR(sometimes),
E_DIFFLENGTH.

<P>
<LI>
Errors - when the package is unable to complete the current operation
because of some failure, but this failure does not adversely affect
the state of the SEQFILE structures (so the SEQIO package can keep
going in later calls).

<P>
Examples of errors are E_PARAMERROR, when an invalid parameter is
given to a function, or E_NOSEQ, where no sequence can be returned
since no sequence occurs in the current entry.

<P>
The one variation on the handling of these errors (this operation
cannot finish, but future operations using that SEQFILE structure are
okay) is when an E_READFAILED or E_PARSEERROR is detected while
reading a file.  In those cases (i.e., on the first parse error or
when the file can't be read), the package stops reading that file and
returns an error result.  However, what happens to the next call to
read the SEQFILE structure depends on whether seqfopen was called to
read a single file or whether seqfopendb was called to read a number
of files.  If seqfopen was called, the next call to read another entry
or sequence will get an EOF signal, whereas when a database is being
read, the package will move on to the next file in the database and
attempt to read the entries from there.  So, the result of the next
call to seqfread, seqfgetseq, seqfgetentry or seqfgetinfo after a
read/parse error depends on whether there are any more files to read.

<P>
On an error, an error message is output and an error value is returned
as the result of the called function.

<P>
The error errno values are: E_EOF, E_OPENFAILED, E_READFAILED,
E_PREVERROR, E_PARAMERROR, E_INVFORMAT, E_PARSEERROR, E_DBPARSEERROR,
E_DBFILEERROR(sometimes), E_NOSEQ, E_FILEERROR.
                                       
<P>
<LI>
Fatal errors - when the error leaves the state of the SEQFILE
structure (or the SEQIO package) in an unrecoverrable state.

<P>
No more operations can be performed using that SEQFILE structure, and
you may only call seqfclose to close the structure (if the program
does not exit on the error).  If further calls are made, the error
status E_PREVERROR is always returned.

<P>
Examples of fatal errors are E_NOMEMORY where the package runs out of
memory, or E_PROGRAMERROR when an internal program bug is detected

<P>
On a fatal error, an error message is output, the system function
`exit' is called (unless disabled using seqferrpolicy), and an error
value is returned as the result of the function (if the exit call is
disabled).

<P>
The fatal errno values are:  E_NOMEMORY, E_PROGRAMERROR.
</OL>

How these errors are handled can be determined by which of the following
predefined constants is passed to seqferrpolicy:
<DL>
<DT> PE_NONE
<DD> Disable all warning/error/fatal message output and all
calls to exit.  So, only seqferrno and seqferrstr are set
on an error (and possibly an error value is returned).
<DT> PE_WARNONLY
<DD> Allow warning messages to be printed, but disable
error/fatal messages and calls to exit.
<DT> PE_ERRONLY
<DD> Allow error/fatal messages, but disable warning messages
and calls to exit.
<DT> PE_NOWARN
<DD> Allow error/fatal messages and calls to exit, but
disable warning messages.
<DT> PE_NOEXIT
<DD> Allow all message output, but disable calls to exit.
<DT> PE_ALL
<DD> Allow all message output and calls to exit.
</DL>
The default policy is PE_ALL, where all output is performed and fatal
errors cause the SEQIO package to call exit.

<P>
The function returns the old error policy as its return value.  So,
if you want to turn off all error reporting for a SEQIO function call,
you can do the following:
<PRE>
old_pe = seqferrpolicy(PE_NONE);

/* Do the SEQIO function call */

seqferrpolicy(old_pe);

if (seqferrno != E_NOERROR &amp;&amp; seqferrno != E_EOF) {
  /* Handle the error/warning that occurred during the function call */
}
</PRE>

<HR>
<ADDRESS> 
<a href="http://wwwcsif.cs.ucdavis.edu/~knight">James R. Knight,</a>
<a href="mailto:knight@cs.ucdavis.edu">knight@cs.ucdavis.edu</a><BR>
June 26, 1996
</ADDRESS>
</BODY>
